Indexation of Document Images Using Frequent Items

نویسندگان

  • Eugen Barbu
  • Pierre Héroux
  • Sébastien Adam
  • Éric Trupin
چکیده

Documents exist in different formats. When we have document images, in order to access some part, preferably all, of the information contained in that images, we have to deploy a document image analysis application. Document images can be mostly textual or mostly graphical. If, for a user, a task is to retrieve document images, relevant to a query from a set, we must use indexing techniques. The documents and the query are translated in a common representation. Using a dissimilarity measure (between the query and the document representations) and a method to speed-up the search process we may find documents that are from the user point of view relevant to his query. The semantic gap between a document representation and the user implicit representation can lead to unsatisfactory results. If we want to access objects from document images that are relevant to the document semantic we must enter in a document understanding cycle. Understanding document images is made in systems that are (usually) domain dependent, and that are not applicable in general cases (textual and graphical document classes). In this paper we present a method to describe and then to index document images using frequently occurences of items. The intuition is that frequent items represents symbols in a certain domain and this document description can be related to the domain knowledge (in an unsupervised manner). The novelty of our method consists in using graph summaries as a description for document images. In our approach we use a bag (multiset) of graphs as description for document images. From the document images we extract a graph based representation. In these graphs, we apply graph mining techniques in order to find frequent and maximally subgraphs. For each document image we construct a bag with all frequent subgraphs found in the graph-based representation. This bag of “symbols” represents the description of the document.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering Document Images Using Graph Summaries

Document image classification is an important step in document image analysis. Based on classification results we can tackle other tasks such as indexation, understanding or navigation in document collections. Using a document representation and an unsupervized classification method, we can group documents that from the user point of view constitute valid clusters. The semantic gap between a do...

متن کامل

Frequent Graph Discovery: Application to Line Drawing Document Images

In this paper a sequence of steps is applied to a graph representation of line drawings using concepts from data mining. This process finds frequent subgraphs and then association rules between these subgraphs. The distant aim is the automatic discovery of symbols and their relations, which are parts of the document model. The main outcome of our work is firstly an algorithm that finds frequent...

متن کامل

A proposition of a robust system for historical document images indexation

Characterizing noisy or ancient documents is a challenging problem up to now. Many techniques have been done in order to effectuate feature extraction and image indexation for such documents. Global approaches are in general less robust and exact than local approaches. That’s why, we propose in this paper, a hybrid system based on global approach (fractal dimension), and a local one, based on S...

متن کامل

رفع اعوجاج هندسی متون به‌کمک اطلاعات هندسی خطوط متن

Document images produced by scanners or digital cameras usually have photometric and geometric distortions. If either of these effects distorts document, recognition of words from such a document image using OCR is subject to errors. In this paper we propose a novel approach to significantly remove geometric distortion from document images. In this method first we extract document lines from do...

متن کامل

Ancient Printed Documents Indexation: A New Approach

Based on the study of the specificity of historical printed books and on the main error sources of classical methods of page layout analysis, this paper presents a new way to achieve an indexation of ancient printed documents. We have developed an approach based on the extraction and the quantification of the various orientations that are present in printed document images. The documents are in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005